AITopics | text decoder

Collaborating Authors

text decoder

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Hierarchical Motion Captioning Utilizing External Text Data Source

Leite, Clayton, Xiao, Yu

arXiv.org Artificial IntelligenceSep-3-2025

This paper introduces a novel approach to enhance existing motion captioning methods, which directly map representations of movement to high-level descriptive captions (e.g., ``a person doing jumping jacks"). The existing methods require motion data annotated with high-level descriptions (e.g., ``jumping jacks"). However, such data is rarely available in existing motion-text datasets, which additionally do not include low-level motion descriptions. To address this, we propose a two-step hierarchical approach. First, we employ large language models to create detailed descriptions corresponding to each high-level caption that appears in the motion-text datasets (e.g., ``jumping while synchronizing arm extensions with the opening and closing of legs" for ``jumping jacks"). These refined annotations are used to retrain motion-to-text models to produce captions with low-level details. Second, we introduce a pioneering retrieval-based mechanism. It aligns the detailed low-level captions with candidate high-level captions from additional text data sources, and combine them with motion features to fabricate precise high-level captions. Our methodology is distinctive in its ability to harness knowledge from external text sources to greatly increase motion captioning accuracy, especially for movements not covered in existing motion-text datasets. Experiments on three distinct motion-text datasets (HumanML3D, KIT, and BOTH57M) demonstrate that our method achieves an improvement in average performance (across BLEU-1, BLEU-4, CIDEr, and ROUGE-L) ranging from 6% to 50% compared to the state-of-the-art M2T-Interpretable.

caption, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2509.01471

Genre: Research Report > Promising Solution (0.34)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

MERaLiON-AudioLLM: Bridging Audio and Language with Large Language Models

He, Yingxu, Liu, Zhuohan, Sun, Shuo, Wang, Bin, Zhang, Wenyu, Zou, Xunlong, Chen, Nancy F., Aw, Ai Ti

arXiv.org Artificial IntelligenceDec-17-2024

We introduce MERaLiON-AudioLLM (Multimodal Empathetic Reasoning and Learning in One Network), the first speech-text model tailored for Singapore's multilingual and multicultural landscape. Developed under the National Large Language Models Funding Initiative, Singapore, MERaLiON-AudioLLM integrates advanced speech and text processing to address the diverse linguistic nuances of local accents and dialects, enhancing accessibility and usability in complex, multilingual environments. Our results demonstrate improvements in both speech recognition and task-specific understanding, positioning MERaLiON-AudioLLM as a pioneering solution for region specific AI applications. We envision this release to set a precedent for future models designed to address localised linguistic and cultural contexts in a global framework.

arxiv, dataset, language model, (15 more...)

arXiv.org Artificial Intelligence

2412.09818

Country:

Asia > Singapore (0.47)
North America > Canada > Ontario > Toronto (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(3 more...)

Genre: Research Report > New Finding (0.68)

Industry: Social Sector (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

Piergiovanni, AJ, Kim, Dahun, Ryoo, Michael S., Noble, Isaac, Angelova, Anelia

arXiv.org Artificial IntelligenceNov-21-2024

Generating automatic dense captions for videos that accurately describe their contents remains a challenging area of research. Most current models require processing the entire video at once. Instead, we propose an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames. Our model uses a novel autoregressive factorized decoding architecture, which models the sequence of visual features for each time segment, outputting localized descriptions and efficiently leverages the context from the previous video segments. This allows the model to output frequent, detailed captions to more comprehensively describe the video, according to its actual local content, rather than mimic the training data. Second, we propose an optimization for efficient training and inference, which enables scaling to longer videos. Our approach shows excellent performance compared to both offline and online methods, and uses 20\% less compute. The annotations produced are much more comprehensive and frequent, and can further be utilized in automatic video tagging and in large-scale video data harvesting.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2411.14688

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

Reducing Hallucinations in Vision-Language Models via Latent Space Steering

Liu, Sheng, Ye, Haotian, Xing, Lei, Zou, James

arXiv.org Artificial IntelligenceOct-22-2024

Hallucination poses a challenge to the deployment of large vision-language models (LVLMs) in applications. Unlike in large language models (LLMs), hallucination in LVLMs often arises from misalignments between visual inputs and textual outputs. This paper investigates the underlying mechanisms of hallucination, focusing on the unique structure of LVLMs that distinguishes them from large language models (LLMs). We identify that hallucinations often arise from the sensitivity of text decoders to vision inputs, a natural phenomenon when image encoders and text decoders are pre-trained separately. Inspired by this, we introduce Visual and Textual Intervention (VTI), a novel technique designed to reduce hallucinations by steering latent space representations during inference to enhance the stability of vision features. As a task-agnostic test-time intervention, VTI can be easily applied to any problem without additional cost. Extensive experiments demonstrate that it can effectively reduce hallucinations and outperform baseline methods across multiple metrics, highlighting the critical role of vision feature stability in LVLMs.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2410.15778

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Automatic Medical Report Generation: Methods and Applications

Guo, Li, Tahir, Anas M., Zhang, Dong, Wang, Z. Jane, Ward, Rabab K.

arXiv.org Artificial IntelligenceAug-25-2024

The increasing demand for medical imaging has surpassed the capacity of available radiologists, leading to diagnostic delays and potential misdiagnoses. Artificial intelligence (AI) techniques, particularly in automatic medical report generation (AMRG), offer a promising solution to this dilemma. This review comprehensively examines AMRG methods from 2021 to 2024. It (i) presents solutions to primary challenges in this field, (ii) explores AMRG applications across various imaging modalities, (iii) introduces publicly available datasets, (iv) outlines evaluation metrics, (v) identifies techniques that significantly enhance model performance, and (vi) discusses unresolved issues and potential future research directions. This paper aims to provide a comprehensive understanding of the existing literature and inspire valuable future research.

dataset, mimic-cxr, report generation, (13 more...)

arXiv.org Artificial Intelligence

2408.13988

Country:

Europe > France > Grand Est > Bas-Rhin > Strasbourg (0.04)
North America > United States > Indiana (0.04)
South America > Peru > Lima Department > Lima Province > Lima (0.04)
(9 more...)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.65)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.93)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

Kim, Jongsuk, Shin, Jiwon, Kim, Junmo

arXiv.org Artificial IntelligenceJul-10-2024

In recent years, advancements in representation learning and language models have propelled Automated Captioning (AC) to new heights, enabling the generation of human-level descriptions. Leveraging these advancements, we propose AVCap, an Audio-Visual Captioning framework, a simple yet powerful baseline approach applicable to audio-visual captioning. AVCap utilizes audio-visual features as text tokens, which has many advantages not only in performance but also in the extensibility and scalability of the model. AVCap is designed around three pivotal dimensions: the exploration of optimal audio-visual encoder architectures, the adaptation of pre-trained models according to the characteristics of generated text, and the investigation into the efficacy of modality fusion in captioning. Our method outperforms existing audio-visual captioning methods across all metrics and the code is available on https://github.com/JongSuk1/AVCap

arxiv preprint arxiv, caption, decoder, (14 more...)

arXiv.org Artificial Intelligence

2407.07801

Country: Asia > South Korea (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Investigating Decoder-only Large Language Models for Speech-to-text Translation

Huang, Chao-Wei, Lu, Hui, Gong, Hongyu, Inaguma, Hirofumi, Kulikov, Ilia, Mavlyutov, Ruslan, Popuri, Sravya

arXiv.org Artificial IntelligenceJul-3-2024

Large language models (LLMs), known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains, present a promising avenue for enhancing speech-related tasks. In this paper, we focus on integrating decoder-only LLMs to the task of speech-to-text translation (S2TT). We propose a decoder-only architecture that enables the LLM to directly consume the encoded speech representation and generate the text translation. Additionally, we investigate the effects of different parameter-efficient fine-tuning techniques and task formulation. Our model achieves state-of-the-art performance on CoVoST 2 and FLEURS among models trained without proprietary data. We also conduct analyses to validate the design choices of our proposed model and bring insights to the integration of LLMs to S2TT.

architecture, speech encoder, translation, (16 more...)

arXiv.org Artificial Intelligence

2407.03169

Country:

Asia > Taiwan (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation

Lan, Zhibin, Niu, Liqiang, Meng, Fandong, Zhou, Jie, Zhang, Min, Su, Jinsong

arXiv.org Artificial IntelligenceJul-3-2024

In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language. In this regard, conventional cascaded methods suffer from issues such as error propagation, massive parameters, and difficulties in deployment and retaining visual characteristics of the input image. Thus, constructing end-to-end models has become an option, which, however, faces two main challenges: 1) the huge modeling burden, as it is required to simultaneously learn alignment across languages and preserve the visual characteristics of the input image; 2) the difficulties of directly predicting excessively lengthy pixel sequences. In this paper, we propose \textit{Translatotron-V(ision)}, an end-to-end IIMT model consisting of four modules. In addition to an image encoder, and an image decoder, our model contains a target text decoder and an image tokenizer. Among them, the target text decoder is used to alleviate the language alignment burden, and the image tokenizer converts long sequences of pixels into shorter sequences of visual tokens, preventing the model from focusing on low-level visual features. Besides, we present a two-stage training framework for our model to assist the model in learning alignment across modalities and languages. Finally, we propose a location-aware evaluation metric called Structure-BLEU to assess the translation quality of the generated images. Experimental results demonstrate that our model achieves competitive performance compared to cascaded models with only 70.9\% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.

decoder, text decoder, translation, (15 more...)

arXiv.org Artificial Intelligence

2407.02894

Country:

Asia > China > Fujian Province > Xiamen (0.04)
North America > United States > Maryland > Baltimore (0.04)
Asia > Taiwan (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Factor-Conditioned Speaking-Style Captioning

Ando, Atsushi, Moriya, Takafumi, Horiguchi, Shota, Masumura, Ryo

arXiv.org Artificial IntelligenceJun-27-2024

This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned captioning (FCC), which first outputs a phrase representing speaking-style factors (e.g., gender, pitch, etc.), and then generates a caption to ensure the model explicitly learns speaking-style factors. We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy, and then generates a caption based on factor-conditioned sampling to ensure diversity. Experiments show that FCC outperforms the original caption-based training, and with GtS, it generates more diverse captions while keeping style prediction performance.

caption, speech encoder, style factor, (15 more...)

arXiv.org Artificial Intelligence

2406.1891

Country: Asia > Japan (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)

Add feedback

Structural Entities Extraction and Patient Indications Incorporation for Chest X-ray Report Generation

Liu, Kang, Ma, Zhuoqi, Kang, Xiaolu, Zhong, Zhusi, Jiao, Zhicheng, Baird, Grayson, Bai, Harrison, Miao, Qiguang

arXiv.org Artificial IntelligenceMay-22-2024

The automated generation of imaging reports proves invaluable in alleviating the workload of radiologists. A clinically applicable reports generation algorithm should demonstrate its effectiveness in producing reports that accurately describe radiology findings and attend to patient-specific indications. In this paper, we introduce a novel method, \textbf{S}tructural \textbf{E}ntities extraction and patient indications \textbf{I}ncorporation (SEI) for chest X-ray report generation. Specifically, we employ a structural entities extraction (SEE) approach to eliminate presentation-style vocabulary in reports and improve the quality of factual entity sequences. This reduces the noise in the following cross-modal alignment module by aligning X-ray images with factual entity sequences in reports, thereby enhancing the precision of cross-modal alignment and further aiding the model in gradient-free retrieval of similar historical cases. Subsequently, we propose a cross-modal fusion network to integrate information from X-ray images, similar historical cases, and patient-specific indications. This process allows the text decoder to attend to discriminative features of X-ray images, assimilate historical diagnostic information from similar cases, and understand the examination intention of patients. This, in turn, assists in triggering the text decoder to produce high-quality reports. Experiments conducted on MIMIC-CXR validate the superiority of SEI over state-of-the-art approaches on both natural language generation and clinical efficacy metrics.

patient-specific indication, report generation, similar historical case, (9 more...)

arXiv.org Artificial Intelligence

2405.14905

Country:

Asia > China > Shaanxi Province > Xi'an (0.05)
North America > United States (0.05)

Genre: Research Report > Promising Solution (0.69)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.73)
Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (0.59)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.54)

Add feedback